Provenance Management in Practice

نویسنده

  • MATTHIJS OOMS
چکیده

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals: 1. Using a provenance archive effectively and efficiently as cache for workflow tasks. 2. Designing techniques to support browsing and navigation through a provenance archive. Early in this research it was determined that a representative use case was needed. A use case, in the form of a scientific workflow, can show the performance improvements possibly gained by caching workflow tasks. If this use case is large-scale and data-intensive, and provenance is collected during its execution, it can also be used to show the levels of detail that can be addressed in the provenance data. Different levels of detail can be of aid whilst browsing and navigating provenance data. The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two. iv | Summary Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-Bio-Flow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype. For the OPM implementation, a profile for the OPM to collect prove-nance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved. A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tackling the Provenance Challenge one layer at a time

VisTrails is a new workflow and provenance management system that provides support for scientific data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, change is the norm. VisTrails uses a new change-based provenance mechanism which was designed to handle rapidly-evolving workflows. It un...

متن کامل

Provenance, Lineage, and Workflows

In Computer Science, Provenance also known as lineage and pedigree describe the source and derivation of data. Data provenance is key to the management of scientific data and has recently been recognized as central to the trust one places in data. This paper focus attention on the importance and difficulty of provenance tracking in practice. We discuss a taxonomy of data provenance characterist...

متن کامل

Provenance-Aware Faceted Search in Drupal

As the web content is increasingly generated in more diverse situations, provenance is becoming more and more critical. While a variety of approaches have been investigated for capturing and making use of provenance metadata, arguably no single best-practice approach has emerged. In this paper, we investigate an approach that leverages one of the most popular content management systems – Drupal...

متن کامل

Do You Know Where Your Data's Been? - Tamper-Evident Database Provenance

Database provenance chronicles the history of updates and modifications to data, and has received much attention due to its central role in scientific data management. However, the use of provenance information still requires a leap of faith. Without additional protections, provenance records are vulnerable to accidental corruption, and even malicious forgery, a problem that is most pronounced ...

متن کامل

On Explicit Provenance Management in RDF/S Graphs

The notion of RDF Named Graphs has been proposed in order to assign provenance information to data described using RDF triples. In this paper, we argue that named graphs alone cannot capture provenance information in the presence of RDFS reasoning and updates. In order to address this problem, we introduce the notion of RDF/S Graphsets: a graphset is associated with a set of RDF named graphs an...

متن کامل

Provenance Capture and Use: A Practical Guide

There is a widespread recognition across MITRE’s sponsors of the importance of capturing the provenance of information (sometimes called lineage or pedigree). However, the technology for supporting capture and usage of provenance is relatively immature. While there has been much research, few commercial capabilities exist. In addition, there is neither a commonly understood concept of operation...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009